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Abstract 

In  statistical  models  involving  constrained  or  missing  data,  likelihoods  containing  integrals 
emerge.  In  the  case  of  both  constrained  and  missing  data,  the  result  is  a  ratio  of  integrals,  which 
for  multivariate  data  may  defy  exact  or  approximate  analytic  expression.  Seeking  maximum  like¬ 
lihood  estimates  in  such  settings,  we  propose  Monte  Carlo  approximants  for  these  integrals,  and 
subsequently  maximize  the  resulting  approximate  likelihood.  Iteration  of  this  strategy  expedites 
the  maximization,  while  the  Gibbs  sampler  is  useful  for  the  required  Monte  Carlo  generation.  As 
a  result,  we  handle  a  class  of  models  broader  than  the  customary  EM  setting  without  using  an 
EM-type  algorithm.  Implementation  of  the  methodology  is  illustrated  in  two  numerical  examples. 

1  Introduction 

In  challenging  parametric  modeling  settings  the  maximum  likelihood  estimator  is  generally 
the  estimator  of  choice.  This  follows  from  both  foundational  considerations  (e.g.  the  Likelihood 
Principle)  as  well  as  practical  ones  (e.g.  good  large  sample  behavior  under  mild  conditions).  Here  we 
propose  a  Monte  Carlo  approach  for  calculation  of  maximum  likelihood  estimators  which  handles 
a  range  of  previously  inaccessible  problems. 

The  context  we  have  in  mind  results  in  a  likelihood  function  which  is  unavailable  explicitly. 
Some  likelihoods  of  this  type  have  been  analyzed  using  the  EM  algorithm  (Dempster,  Laird  and 
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Rubin,  1977).  As  we  clarify  later  in  this  section,  however,  the  class  of  models  we  envision  yields  a 
likelihood  which  cannot  be  handled  by  the  customary  version  of  this  algorithm. 

Though  we  present  our  method  in  terms  of  general  multivariate  joint  distributions,  all  of  our 
illustrations  and  data  examples  assume  an  underlying  exponential  family  of  models.  This  is  because 
the  behavior  of  the  likelihood  surface  and  hence  the  properties  of  the  MLE  are  perhaps  best 
understood  in  such  families  (see  e.g.  Bamdorif-Nieisen,  1978;  Brown,  1986;  Jacobsen,  1988,  and 
references  therein).  We  do  not  address  theoretical  concerns  regarding  e.g.  existence,  uniqueness, 
consistency,  or  asymptotic  normality.  Rather,  we  offer  a  method  for  obtaining  the  maximum  of 
the  likelihood  when  it  is  reasonably  well  behaved.  Problems  and  remedies  associated  with  poony 
behaved  likelihoods  are  well  discussed  in  the  literature  and  apply  to  our  approach  as  well.  In 
particular,  the  use  of  multiple  starting  points  with  a  given  maximization  routine  often  helps  avoid 
convergence  to  local,  rather  than  global,  maxima. 

Our  models  assume  the  observed  data  to  be  constrained  in  some  way.  We  also  aiiow  for  the 
possibility  of  missing  data  with  constraints  upon  the  entire  set  of  variables,  both  observed  ana 
unobserved.  As  a  general  version  of  this  setting  let  x  denote  the  fc-dimensionai  observed  data 
vector  and  9  the  p-dimensionai  parameter  vector.  We  suppose  that  the  likelihood  takes  the  :orm 


1(0:  x; 


cifx:  01 
c2(,0) 


cu 


x:  9)  —  I  f<\.y0)dy 
JvtCx; 


/  is  a  parametric  family  of  densities,  and 


c2(0)  =  f  C;(x;  0)dx  . 

JXfzS 
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That  is,  as  a  function  of  x,  L(9;x)  is  a  normalized  density  function.  Here  y  is  viewed  as  a  1- 
dimensional  missing  (or  latent)  data  vector  constrained  by  the  observed  x  to  the  set  C(x),  with 
the  observed  x  itself  constrained  to  a  set  5. 

In  the  case  where  there  is  no  missing  data  we  take  ci(x;  9)  to  be  a  parametric  family  of  densities 
for  x  with  c2(9)  being  the  normalizing  constant  arising  from  the  restriction  of  x  to  5.  Computation 
of  the  function  c2(9)  then  requires  a  ^-dimensional  integration  which  we  presume  cannot  be  carried 
out  explicitly.  In  fact  with  k  large  and  5  awkward,  evaluation  of  c2(9)  at  a  particular  90  may 
defy  exact  or  approximate  analytic  numerical  integration.  Hence,  we  are  drawn  to  Monte  Carlo 
approaches.  In  the  case  of  latent  or  missing  data  y,  computation  of  the  function  cj(x;0)  requires 
a  t-dimensional  integration  over  a  constrained  set  C(x),  which  again  we  cannot  carry  out  explictly. 
Moreover,  c2(9 )  now  requires  a  (fc  -f-  f)-dimensional  integration. 

Of  course,  in  principle  one  could  attempt  a  grid  search  for  the  maximizing  9  in  (1).  That  is,  at 
a  given  9 ,  perform  Monte  Carlo  integrations  for  c*  and  c2  to  obtain  L,  and  then  search  through  the 
space  of  9  for  a  maximum  L.  This  is  in  fact  the  approach  that  emerges  in  several  papers  horn  the 
econometrics  literature  on  simulated  moments  estimation;  see  for  example  McFadden  (1989)  and 
Pakes  and  Pollard  (1989).  The  primary  concern  of  these  papers  appears  to  be  the  accuracy  and 
precision  of  the  simulator  carrying  out  the  integrations,  rather  than  efficient  maximization  using 
such  a  simulator.  These  papers  typically  assume  the  observed  data  x  to  be  a  deterministic  function 
of  the  latent  data  y,  thus  simplifying  the  structure  of  the  integrals  in  (1).  Constrained  multivariate 
normal  models  for  y  are  presumed,  resulting  in  tailored  simulators  inappropriate  for  the  broader 
class  of  statistical  models  we  envision  (see  Section  2).  Moreover,  when  p  is  large  a  naive  grid  search 
for  the  MLE  9  may  be  impractical;  for  smaller  p  our  proposed  method  is  faster. 

The  EM  algorithm  is  a  widely  used  tool  for  handling  incomplete  data  problems.  It  cannot, 
however,  accommodate  constraints  on  the  observed  data.  That  is,  it  presumes  that  that  ci(x;0) 
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is  itself  a  normalized  density  in  x,  so  that  c2(0)  in  (1)  disappears.  To  clarify,  recall  the  general 
version  of  the  EM  algorithm  as  presented  in  Dempster,  Laird  and  Rubin  (1977).  At  the  Ith  stage, 
given  0  =  the  E-step  computes  a  function  Q{0'\9  =  9^)  which  the  M-step  then  maximizes 
over  9' .  In  the  case  of  (1), 


Q(0'\0)  = 


Ic(x)  /(x>  y  lfi)  lo6  /(*»  yl^)rfy 

ci(x;  9) 


log  C2(®/)  . 


But  since  this  expression  still  involves  ct  and  c2,  it  is  no  easier  to  work  with  than  (1).  In  summary, 
if  the  likelihood  is  of  the  form  (1)  we  need  to  approximate  one  or  both  of  ci(x;0)  and  c2(0)  as 
functions  of  9.  While  Monte  Carlo  integration  seems  to  be  a  natural  tool  in  this  regard,  it  is  not 
immediately  clear  how  to  proceed. 

We  develop  Monte  Carlo  approximants  following  importance  sampling  ideas  proposed  in  Geyer 
and  Thompson  (1992).  In  their  setting  (an  autologistic  model),  data  vectors  Xi,...,Xn  come 
from  an  exponential  family  model  where  only  the  nonnorm alized  form  of  the  exponential  ker¬ 
nel,  exp(0'T(x)),  is  specified.  That  is,  the  vector  T  is  chosen  such  that  the  sufficient  statistic, 
£(*=I  T(x^),  is  a  suitable  summary  of  the  data.  Hence  the  normalizing  constant,  a  function  of  9 
and  therefore  needed  for  the  ML  estimation,  is  unknown  and  requires  integration  over  X  to  compute. 
Geyer  and  Thompson  introduce  a  Monte  Carlo  approximant  for  this  function,  as  well  as  an  iterative 
approach  for  carrying  out  the  maximization.  We  generalize  their  ideas  beyond  approximation  of 
the  normalizing  function  to  a  broad  class  of  constrained  or  missir.g  data  problems. 

The  format  for  the  remainder  of  the  paper  is  as  follows.  In  Section  2  we  offer  a  collection  of 
motivating  examples  where  likelihoods  of  the  form  (1)  arise.  In  Section  3  we  formalize  the  Monte 
Carlo  approach.  Finally,  Section  4  presents  two  datasets  for  which  models  of  the  form  )  arc  used 
and  the  approach  of  Section  3  is  carried  out. 
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2  Illustrative  Examples 

In  order  to  demonstrate  tie  range  of  application  of  our  approach,  in  the  following  subsections 
we  present  three  situations  where  the  form  (1)  arises.  Two  of  these  are  analyzed  more  fully  using 
appropriate  datasets  in  Section  4.  Other  illustrations  include  multivariate  biased  sampling  settings, 
and  the  analysis  of  adaptive  patient  followup  schemes  for  clinical  trials  data. 

2.1  Patterned  covariance  models 

An  elementary  illustration  is  provided  by  a  constrained  vector  x  from  a  multivariate  normal 
having  mean  0  and  covariance  E(0),  assumed  to  be  a  patterned  matrix.  Such  structure  arises  in 
variance  components  models,  time  series  models,  and  moving  average  processes.  Particular  forms 
include  (a)  £„(0)  =  a2,  £tJ(0)  =  pa2,  (b)  £t,-(0)  =  £r2p!‘_jl,  (c)  2»,-(0)  =  a2b and  (d) 
£,j (0)  =  (m  —  |x  -  j\)a2,  |i  -  j j  <  m;  E,j(0)  =  0,  \i  -  j\  >  m  .  See  also  Rubin  and  Szatrowski 
(1982)  in  this  regard.  Constraints  on  x  might  be  |x,|  <  c,-,  i  =  1, .  .  or  Xi  <  x2  <  •  •  •  <  Xfe- 
In  the  equicorrelated  case  (i)  under  say  the  latter  constraints,  ci(x;  0)  is  precisely  the  multivariate 
normal  density  Nk(0,  £(#))  with 

c2(0)=  /  Afc(x!O,S(0))d=1d=2-..d=fc. 

Jx\<x,<-<Zk 

Unless  k  is  very  small,  calculation  of  c2  at  a  given  9o  is  only  feasible  using  Monte  Carlo  methods. 

2.2  Categorical  data  models 

Consider  the  case  of  truncated  multinomial  trials.  For  instance,  it  is  sometimes  the  case  that 
the  observation  of  a  zero  count  is  truncated.  We  dichotomize  according  to  presence  or  absence 
and  then,  if  present,  record  how  many.  If  we  assume  cell  counts  arise  from  independent  Poisson 
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distributions  in  this  fashion,  the  conditional  distribution  of  cell  counts  given  the  total  number 
becomes  a  truncated  multinomial.  More  generally  for  a  fixed  number  of  trials,  n,  suppose  the  itk 
cell  is  restricted  to  have  at  least  r*  observations,  i  =  1, . . . ,  k.  Let  x  =  (xx, . . . ,  where  xt-  denotes 
the  count  in  the  ith  cell,  with  x,-  >  rt-  for  each  i,  and  g,-  is  the  probability  associated  with  the  ith 
cell.  Then,  with  9  =  ci(x;0)  =  nl flLi gfV*;-1  and  c2(0)  =  £sc i(*;0)  where  now 

5  =  {x:  each  x*  is  integer- valued,  xt-  >  r^,  and  ]T*=1  s*  —  n}-  Other  variations  include  the  case 
where  a  particular  multinomial  cell  is  known  to  supply  the  largest  count  or  where  the  counts  are 
constrained  to  increase  up  to  a  particular  cell  and  then  decrease  thereafter. 

2.3  Compositional  data  models 

Frequently  samples  are  taken  such  that  each  observation  is  a  vector  whose  components  sum  to 
1.  Examples  include  land  samples  described  in  terms  of  proportions  of  different  types  of  vegetation, 
soil  samples  described  by  proportion  of  chemical  content,  and  rock  samples  described  by  proportion 
of  mineral  content.  Such  data  is  referred  to  as  compositional  data  (Aitchison,  1986).  Distributional 
models  are  specified  on  the  simplex  {p  =  (p\,p2,  •  ■  •  ,Pk)  -  Pi  >  O.HLxFt'  =  !}• 

Let  /(pjfl)  denote  a  parametric  specification  of  the  joint  density  for  p.  The  most  obvious  choice 
of  /,  the  Dirichlet  family,  is  usually  undesirable  since  it  forces  an  assumption  of  negative  correlation 
amongst  every  ( pi,pj )  pair,  as  well  as  certain  conditional  independencies.  Instead,  baseline  logit 
transformations  zt-  =  log(pi/pk)  are  often  adopted,  with  z  =  (xj, . . . ,  Zk-i)  assumed  to  follow,  say,  a 
Nk- i(m,  £)  distribution,  so  that  9  =  [n,  E).  The  ( k  -  1)  logits  uniquely  determine  the  composition. 

The  mean  /x  is  often  expressed  as  a  parametric  function  of  explanatory  variables.  Of  particular 
interest  is  the  covariance  matrix  E,  since  the  nature  of  covariation  between  proport. ons  is  a  primary 
research  question.  Usually  MLE’s  are  sought,  so  that  given  samples  p< ,  t  =  1,  . . ,  n,  we  would 
convert  to  zt  and  then  obtain  the  customary  MLE  for  /x  and  E. 
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Now  suppose  there  are  constraints  on  the  p’s.  For  instance,  we  might  know  that  the  first 
classification  is  most  common,  i.e.  pi  >  Pi,  i  =  2,..  or  we  might  know  that  the  classifications 
are  in  decreasing  order  of  prevalence,  i.e.  pi  >  Pz  >  •  *  •  >  p*.  On  the  logit  scale  these  convert  to 
zi  ^  *  ==  2, . . ., k,  and  Z\  >  >  •••  >  Zk-i  >  0,  respectively.  If  S  denotes  these  constraints 
“d  i\G)  denotes  the  density  of  z  then  c2(0)  =  Js  f{zi, .  ..,2fc_i|0)  dzx  •  •■dzk_l. 

Next  imagine  that,  as  often  happens,  the  k 01  classification  is  a  leftover,  or  “other"  category. 
Unfortunately  what  classifcations  comprise  “other"  may  vary  across  data  collection  sources.  That 
is,  we  can  think  of  p  as  the  most  refined  classification  vector,  but  we  actually  observe  q’s  where 
components  of  p  are  collapsed.  Then  the  likelihood  for  the  observed  data  qi, . .  • ,  qn  takes  the  form 

L(0;qi,...,qn)  =  TT  /  f(ptu---,Ptk\9)dpn---dptk  ,  (3) 

<=1  dC(q,) 

where  C(qt)  reflects  the  collapsing  of  pt  to  give  qt.  Whether  on  the  p  scale  or  the  z  scale,  expression 
(3)  is  of  the  form  (2).  If  we  also  incorporate  the  previouly  described  restrictions  on  p,  the  likelihood 
takes  the  most  general  form  (1). 

3  The  Monte  Carlo  approximant  approach 

Returning  to  (1),  observe  that  we  may  write 

cl(x;  B)  =  c,(x;  ft,) .  V*r)  (/c(x)  I*.  **)  .  C«> 

and  similarly 

c’m  =  c’w •  (XXW  n^fix'yle°)dydx)  (LL  /(x’y Wdydx)  •  (5) 
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Thus  if  {w;,  j  =  are  drawn  from  $(y|x,  0O),  the  conditional  distribution  of  y  given  x 

and  9q  restricted  to  C(x),  then  a  Monte  Carlo  approximant  for  (4)  is 


/(x,w;|g) 
/(x,w;|0o)  ' 


(6) 


If  instead  {x^,  j  =  1  are  drawn  from  p(x|0o),  the  distribution  of  x  given  90  restricted  to 

S,  and  subsequently  for  each  x*  a  y  *  is  drawn  from  j(y  |x*,  9q),  then  a  Monte  Carlo  approximant 


for  (5)  is  given  by 


/(*j,y;|g) 

/(x;,y;|0o)  ' 


(7) 


Hence  we  may  approximate  the  log  likelihood  log£(0;x)  by  logcx(x;0)  -  logc2(0).  This  in  turn 
implies  that  an  approximate  MLE  9  may  be  found  by  maximizing 


log  y  _  Io,  y  J&yW 

/(*.w;i0o)  zfr[  fix', y;\90)  ’ 


where  we  have  ignored  terms  free  of  9.  Thus  using  the  approximants  (6)  and  (7),  we  have  replaced 
an  intractable  form  (1)  with  an  explicit  form  (8).  Note  that  the  terms  free  of  9  involve  the  unknown 
quantities  cx(x;  0O)  and  c2(0o),  but  fortunately  these  need  not  be  evaluated.  Henceforth,  we  shall 
refer  to  the  method  of  finding  9  via  maximization  of  (8)  as  the  MCMLE  algorithm.  Notice  that  if 


there  is  no  missing  data  y,  a  Monte  Carlo  approximant  is  not  needed  for  cx(x;0)  in  (1),  and  (8) 


simplifies  to 


A  byproduct  of  the  explicit  Monte  Carlo  approximant  is  the  possibility  of  approximation  to 
the  asymptotic  covariance  matrix  of  the  MLE.  If  we  calculate  (either  analytically  or  numerically) 
the  Hessian  matrix  of  mixed  partial  derivatives  of  (8)  with  respect  to  9  evaluated  at  the  MLE,  say 


9 
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E(Q),  then  [— -S'(^)]-1  provides  a  rough  indication  of  the  uncertainty  associated  with  6. 

Implementation  of  expression  (8)  poses  two  cha’lenges:  carrying  out  the  required  sampling  to 
create  the  Monte  Carlo  approximants,  and  maximizing  the  resulting  expression.  With  regard  to 
the  first  challenge,  the  sampling  for  (6)  and  (7)  requires  making  draws  from  a  joint  distribution 
whose  form  is  known  perhaps  only  up  to  normalizing  constant  and  which  is  confined  to  a  specified 
set.  If  the  joint  density  for  x  and  y  is  of  the  form  /(x,  y|0),  then  in  (6),  given  x  and  0O  we  need  to 
sample  from  /(y|x,0o)  oc  /(x,-yj#o)  restricted  to  C(x).  In  (7)  we  need  to  sample  from  /(x,y|0o) 
restricted  to  the  set  {(x,y) :  x  €  5  and  y  £  C(x)}. 

In  some  cases,  simple  rejection  sampling  (e.g.  generating  y  from  /(y|x,  #o)  and  retaining  it  if  it 
belongs  to  <7(x)),  though  inefficient,  will  be  easiest.  Alternatively,  Markov  chain  Monte  Carlo  using 
the  Gibbs  sampler  (see  e.g.  Gelfand  and  Smith,  1990;  Tierney,  1991)  is  attractive  here  since  required 
sampling  is  from  complete  conditional  distributions,  all  of  which  are  proportional  to  the  joint  density 
/(x,y|0).  Let  us  adopt  the  notation  y_j  =  (yx, . . .,  y»-i,y,+i, .  • . ,  yt),  with  a  similar  definition  for 
x_,-.  Then  in  (6)  we  need  to  draw  from  /(yi|y-i,x,0),  t  =  l,...,i,  appropriately  restricted.  In 
(7)  we  need  also  to  draw  from  /(xi|x_i,y,0),  i  =  1  again  appropriately  restricted.  At  a 

particular  yi  or  ij  the  constraint  sets  must  be  viewed  in  univariate  cross  sections  given  the  other 
variates.  This  typically  results  in  restriction  to  an  interval  or  a  set  of  intervals.  Recent  work  of 
Gelfand,  Smith,  and  Lee  (1992)  is  pertinent  here  in  discussing  one-for-one  sampling  from  such 
univariate  distributions.  Though  their  work  is  in  the  Bayesian  framework  where  integration  and 
sampling  is  over  the  parameter  space,  it  is  equally  well  applicable  for  our  situation  where  integration 
and  sampling  is  over  the  data  space. 

In  the  approximations  (6)  and  (7),  /(•,  -|0o)  plays  the  role  of  an  importance  sampling  density. 
More  generally,  for  instance,  the  w*’s  in  equation  (6)  might  be  drawn  from  any  importance  sampling 
density  h(y|x)  that  is  appropriately  restricted  to  C(x).  But  no  such  single  density  could  possibly 
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perform  well  for  all  integrands  over  the  range  of  possible  values  for  6  On  the  other  hand,  adopting 
an  h  that  changed  with  9  would  require  a  simulation  for  each  0,  rendering  any  naive  MLE  grid 
search  algorithm  infeasible. 

How  do  we  avoid  this  problem?  Our  selection  of  ^(y|x,  0O)  for  fc(y|x)  suggests  an  iterative 
approach  to  create  a  sequence  of  importance  sampling  densities  that  improves  relative  to  the  density 
/  at  the  MLE  9.  This  approach  eliminates  the  need  for  grid  search  and  also  the  costly  set-up 
time  in  developing  appropriate  importance  sampling  densities.  The  idea  follows  from  Geyer  and 
Thompson  (1992),  who  observed  that  starting  at  some  0(°)  if  we  maximize  (8)  to  obtain  9  we  can 
take  9^  =  9  and  rerun  the  entire  procedure,  resulting  in  a  new  9  and  an  iterative  version  of 
the  procedure.  To  understand  the  value  of  iteration  we  need  to  distinguish  the  maximization  of 
(8)  from  the  maximization  of  (1).  Expression  (8)  is  an  approximation  to  (1)  which  depends  upon 
the  accuracy  of  the  approximations  (6)  and  (7).  If  we  view  /(x,y(0o)  as  an  importance  sampling 
density  for  /(x,y|0),  then  for  a  given  B\  and  B ?  the  Monte  Carlo  integration  for  Ci(x;  9)  and  C2(0) 
improves  as  /(x,y|0o)  gets  “closer”  to  /(x,y|0),  i.e.  as  9q  gets  closer  to  9.  Thus  the  sequence 
{0^}  produced  by  iteration  should  be  getting  closer  to  the  true  0  which  maximizes  (1).  Since  our 
objective  is  only  to  insure  a  good  Monte  Carlo  approximant,  we  need  not,  however,  take  more  than 
a  few  iterations  to  obtain  0^  in  the  vicinity  of  the  true  0.  At  this  point,  one  final  iteration  with 
Bi  and  Bz  very  large  (say  10,000)  will  produce  an  accurate  final  estimate.  Thus  through  a  small 
number  of  iterations  we  achieve  an  efficient  and  broadly  applicable  maximization  strategy. 

4  Numerical  Examples 

4.1  Truncated  Correlated  Normal 

Consider  the  eauicorrelated  k- variate  normal  distribution  described  in  Subsecricn  ».l  where 
£,i  =  1,  E,j  =  p  for  i  £  j,  subject  to  truncation  to  the  set  5  =  {x  :  max|ii|  <  L}  for  some 
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L>0.  We  seek  the  MLE  oid~p.  Notice  that  direct  standardization  of  this  likelihood  involves  a 
dimensional  integral  for  each  candidate  p  value,  whereas  our  Monte  Carlo  approach  requires  only 
the  generation  of  x*’s  from  the -truncated  correlated  normal,  given  the  current  approximation  p0. 

To  carry  out  the  required  sampling,  we  note  that  Ejt1  =  at  and  Ejt1  =  bk  for  i  ^  j,  where 
afc  =  [-(&  -  2 )p  -  l]/[(fc  -  l)p2  -  (k  -  2 )p  -  1]  and  bk  =  p/[(k  -  l)p2  ~(k~  2 )p  -  1]  (see  e.g.  Rao, 
1973).  Hence  for  any  i , 

^(1  t4  a  A  ^  ~  l(-L,L)(Ii)  >  (10) 

where  =  1  —  (1  —  p)ak-\.  Generation  of  the  necessary  samples  may  now  proceed  by  a  Gibbs 

sampling  algorithm.  That  is,  we  successively  sample  from  the  complete  conditional  distributions  in 
(10),  updating  the  value  of  zij  35  we  g°-  After  a  suitably  large  number  of  “bum-in”  iterations 
Ny  the  x*  values  emerging  from  the  sampler  are  approximately  distributed  according  to  their  true 
joint  distribution. 

With  regard  to  implementation,  we  first  choose  a  starting  value  for  x,*-,  then  run  the 
substitution  sampling  chain  for  N  “bum-in”  iterations  to  essentially  reach  the  chain’s  ergodic 
distribution,  and  finally  continue  for  an  additional  Bz  iterations,  now  retaining  the  x*  values 
generated  for  use  in  a  Monte  Carlo  approximant.  Values  obtained  in  this  way  will  of  course  be 
serially  correlated,  and  some  authors  thus  recommend  retaining  only  every  Mth  sample.  Even  for 
M  —  1,  however,  the  xj’s  will  still  be  from  the  correct  ergodic  distribution;  taking  f?2  large  will  also 
help  to  ameliorate  this  problem.  Proper  selection  of  N  (ascertaining  "convergence”  of  the  sampler) 
is  another  important  issue,  and  a  source  of  much  recent  research  interest  (see  for  example  Gelman 
and  Rubin,  1992;  Raftery  and  Lewis,  1992).  In  this  case,  our  experience  with  normal  sampling 
models  convinced  us  that  taking  N  =  20  would  constitute  ample  burn-in. 
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Using  our  Gibbs-sampled  x*’s,  the  Monte  Carlo  approximant  to  the  log  likelihood  (9)  for  this 
model  becomes 


2  sis  [  2xS  x  log £,=l  [is-'jwi 

«  -Ix'S-ix-logEf^acptlx-'tSo'1  -  2-')xJ] 

-  log  Efi,  «P  {1  eLi  [(*?’  -  W*&G±,  *«)  +  (40)  -  n  -  4°’  +  }  , 


where  oj.  and  bk  are  defined  a a  above,  and  4°'  and  are  defined  similarly  but  depend  on  the 
fixed  value  po  instead  of  the  unknown  p.  Note  that  our  calculations  feature  two  levels  of  iteration 
(Gibbs  sampling  within  our  MCMLE  iterative  framework). 

We  apply  the  above  method  to  the  following  vector  of  k  =  10  observations:  x'  =  (-0.167,-0.934, 
0.175,-0.349,-1.012,-0.378,-0.720,-1.208,-0.664,-1.435).  This  x  was  generated  from  the 
truncated  correlated  normal  having  p  =  0.5  and  L  —  2.  We  used  an  initial  guess  of  po  =  p  —  0.5,  and 
a  univariate  maximization  routine  with  maximum  search  window  ±0.1  (recall  -1  /(k  —  1)  <  p  <  1 
in  order  for  E  to  be  nonsingular).  Running  the  MCMLE  algorithm  for  t  =  6  iterations  (and  us¬ 
ing  f?2  =  10,000  replications  at  iteration  6)  we  obtained  the  MLE  p  =  0.743  and  an  associated 
approximate  standard  deviation  (computed  numerically  using  second  differences)  of  0.126.  In  this 
example,  convergence  is  rapid  even  for  poor  initial  choices  of  po . 


4.2  Constrained  2x2  Table 

One  of  the  datasets  presented  by  Andrews  and  Herzberg  (1985)  concerns  species  composition  in 
a  continuous,  roughly  semicircular  arc  of  woodlands  near  Bradford,  England.  The  data,  collected 
by  students  at  the  University  of  Eradford,  record  counts  of  various  kinds  of  trees  at  several  sites 
within  each  of  the  woodlands.  Table  1  gives  the  numbers  of  oak  and  sycamore  trees  in  two  such  sites 
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from  different  woodlands  (one  in  Royd’s  Cliffe  and  one  in  Dixon’s  Wood).  However,  this  selection 
was  not  done  completely  arbitrarily:  Royd’s  Cliffe  sites  having  no  oak  trees  were  not  eligible  for 
selection. 


Site  1  (Royd’s  Cliffe) 
Site  2  (Dixon’s  Wood) 


Table  1:  Restricted  multinomial  data:  Tree  counts  in  two  woodland  sites 


Oaks  Sycamores 


2 

3 

to 

8 

Assuming  the  observations  in  this  table  arise  as  independent  Poisson  variables,  it  is  customary 
to  condition  on  the  total  count  n.  This  results  in  a  multinomial  M(n\  0 )  model  with  n  =  15  and 
0  —  (911,912,921,922),  but  under  the  restriction  that  xn  >  0.  This  is  a  special  case  of  the  class 
of  restricted  categorical  data  models  considered  in  Subsection  2.2.  Direct  standardization  of  the 
likelihood  via  simple  summation  would  be  possible  in  this  trivariate  problem,  but  a  very  tedious 
accounting  problem  indeed.  We  shall  use  the  Monte  Carlo  approach  to  find  maximum  likelihood 
estimates  for  the  g^’s  and  the  odds  ratio  R  =  (9n922)/(9i292i)>  aad  compare  these  results  to  those 
obtained  presuming  an  unrestricted  model. 

The  unstandardized  likelihood  for  this  model  is 


Cl(x;  0)  OC  9lll9l2S9211922  *U  115  121  l{l,...,n}(xll)l{0,...,n}(a:12)l{0 . n}(I2l)  , 


so  that  the  Monte  Carlo  log  likelihood  (9)  becomes 


*11  log  9n  +  *12  log  912  +  xn  log  92;  +  (n  -  in  -  ®t2  -  X21 )  log  922 

-logEfe 


•t  W  *11  j  m7lj  "-Ui  J;  -*J1, 

a.-*- 

I  .  uy.  Wi  "  "11<  Jii 

L*u,o  ?u,o  ?n,o  ?«,o 


where  the  xj  values  have  been  generated  from  our  truncated  multinomial  model  conditional  on  the 
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parameter  values  3n,o.  312,0?  ?2i,o>  and  ?22,o«  Generating  multinomial  observations  and  rejecting 
those  having  =  0  is  a  crude  but  effective  way  of  drawing  the  x*. 

Convergence  of  the  MCMLE  algorithm  was  again  rapid:  only  i  —  3  iterations  were  required  to 
obtain  the  estimates  qu  =  0.111,  312  —  0.204,  and  321  =  0.137,  with  associated  standard  deviations 
of  0.096, 0.108,  and  0.091,  respectively  (again  arising  from  a  numerically  computed  Hessian).  These 
results  again  used  B-i  =  10,000  Monte  Carlo  replications  at  the  final  iteration.  The  fact  that 
qu  <  321  is  intuitively  plausible,  since  these  two  cells  produced  the  same  observed  counts  but, 
unlike  121,  ®n  could  not  have  been  0.  The  unrestricted  MLE’s  in  this  case  are  of  course  2/15  = 
0.133,  3/15  =  0.2,  and  2/15  =  0.133,  respectively.  The  discrepancy  for  the  odds  ratio  R  is  more 
pronounced:  while  the  raw  sample  odds  ratio  is  2.7,  its  MLE  under  the  restricted  model  is  only 
2.2  (estimated  standard  deviation  3.0).  The  MCMLE  algorithm’s  ability  to  produce  results  in  such 
settings  enables  comparison  of  standard  models  with  interesting  but  unwieldy  truncated  ones. 
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Abstract 


In  statistical  models  involving  constrained  or  missing  data,  likelihoods  containing  integrals 
emerge.  In  the  case  of  both  constrained  and  missing  data,  the  result  is  a  ratio  of  integrals,  which 
for  multivariate  data  may  defy,  exact  or  approximate  analytic  expression.  Seeking  maximum  like¬ 
lihood  estimates  in  such  settings,  we  propose  Monte  Cario  apprarimants  for  these  integrals,  and 
subsequently  maximize  the  resulting  approximate  likelihood.  Iteration  of  this  strategy  expedites 
the  maximization,  while  the  Gibbs  sampler  is  nsefnl  for  the  required  Monte  Cario  generation.  As 
a  result,  we  handle  a  class  of  models  broader  than  the  customary  EM  setting  without  using  an 
EM-type  algorithm.  Implementation  of  the  methodology  is  illustrated  in  two  numerical  examples. 


